Large Alphabets and Incompressibility 



Travis Gagie 

Department of Computer Science 
University of Toronto 

o 
o 

CN ■ Abstract 

b 

■ We briefly survey some concepts related to empirical entropy — normal numbers, de 

Bruijn sequences and Markov processes — and investigate how well it approximates 

\ Kolmogorov complexity. Our results suggest £th-order empirical entropy stops being 

a reasonable complexity metric for almost all strings of length m over alphabets of 

size n about when n surpasses m. 
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For data compression, machine learning and cryptanalysis, we often want to 
know the Kolmogorov complexity K(S) [23,13,4,15] of a string S, that is, the 
minimum space needed to store S. It is formally defined as the length in 
bits of the shortest program that outputs S. Notice our choice of program- 
ming language does not affect this length by more than an additive constant, 
provided it is Turing-equivalent; for example, the length of the shortest such 
FORTRAN program exceeds the length of the shortest such LISP program by no 
more than the length of the shortest LISP-interpreter written in FORTRAN — 
which does not depend on S. Unfortunately, a simple diagonalization shows 
Kolmogorov complexity is incomputable: Given a program A for computing 
Kolmogorov complexity, we could write a program B that searches until it 
finds and outputs a string S with A(S) = K(S) greater than 23's length in 
bits, contradicting the definition of K(S). Thus, researchers substitute various 
other complexity metrics; in this paper we study one of the most popular - 
empirical entropy. 

Empirical entropy is rooted in information theory. Let X be a random vari- 
able that takes on one of n values according to P — pi, . . . ,p n . Shannon [20] 
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proposed that any function H(P) measuring our uncertainty about X should 
have three properties: 



(1) U H should be continuous in the pi" 

(2) "If all the pi are equal, p% — -, then H should be a monotonic increasing 
function of n." 

(3) "If a choice be broken down into two successive choices, the original H 
should be the weighted sum of the individual values of if." 

He proved the only function with these properties is H(P) = J27=iPi log(l/pi) , 
which he called the entropy of P. The choice of the logarithm's base determines 
the unit; by convention, log means log 2 and the units are bits. 

Let £ be a non-negative integer and suppose S = s± • • ■ s m . The Ith-order 
empirical entropy of S (see, e.g., [16]) is our expected uncertainty about the 
random variable given a context of length £, as in the following experiment: 
i is chosen uniformly at random from {1, . . . , m}; if % < £, then we are told sf, 
otherwise, we are told s«_^ • • • Sj_i. Specifically, 



H e (S) 



#a(S) m -ten 

aeS 171 
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m 



J2\s a \H (s a ) xe>i. 



|o| = 



In this paper, a G S means character a occurs in S; # a {S) is the number 
of occurrences of a in S; and S a is the string whose ith character is the one 
immediately following the ith occurrence of string a in S — the length of S a 
is the number of occurrences of a in S, which we denote # a (S), unless a is 
a suffix of S, in which case it is 1 less. We assume S a = S when a is empty. 
Notice < H e+1 (S) < H e (S) < log \{a : a e S}\ for I > 0. For example, if S 
is the string TORONTO, then 

^ (5 , ) = ^log7+^log^ + ^log7+^log^ wl.84 , 

H^S) = l - (H (S N ) + 2H (S o ) + H (S R ) + 2H (S T )) 

= l - (H (T) + 2# (RN) + H (O) + 27f (OO)) 
= 2/7^0.29 

and all higher-order empirical entropies of S are 0. This means if someone 
chooses a character uniformly at random from TORONTO and asks us to 
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guess it, then our uncertainty is about 1.84 bits. If they tell us the preceding 
character before we guess, then on average our uncertainty is about 0.29 bits; 
if they tell us the preceding two characters, then we are certain of the answer. 

Empirical entropy has a surprising connection to number theory. Let (x) n>m 
denote the first m digits of the number x in base n > 2. Borel [2] called x 
normal in base n if, for a G {0, . . . ,n — 1}*, Hirim^^ # Q (( a: )"- m ) = l/n' Q L For 
example, the Champernowne constant [5] and Copeland-Erdos constant [6], 
0. 123456 7891011 12. .. and 0.235 711 13171923. .., are normal in base 
10. Notice x being normal in base n is equivalent to linim^oo H^{x) ntm ) = logn 
for I > 0. Borel called x absolutely normal if it is normal in all bases. He proved 
almost all numbers are absolutely normal but Sierpinski [21] was the first to 
find an example, which is still not known to be computable. Turing [24] claimed 
there exist computable absolutely normal numbers but this was only verified 
recently, by Becher and Figueira [1]. Such numbers' representations have finite 
Kolmogorov complexity yet look random if we consider only empirical entropy 
- regardless of base and order. Of course, we are sometimes fooled whatever 
computable complexity metric we consider. 

Now consider de Bruijn sequences [7] from combinatorics. An n-ary linear 
de Bruijn sequence of order i is a string over {0, . . . ,n — 1} containing ev- 
ery possible £-tuple exactly once. For example, the binary linear de Bruijn 
sequences of order 3 are the 16 10-bit substrings of 00010111000101110 and 
its reverse: 0001011100, . . . , 1000101110, 0111010001, . . . , 0011101000. By def- 
inition, such strings have length n e + £ — 1 and £th-order empirical entropy 
(but (£ — l)st-order empirical entropy ^T^f " )■ However, Rosenfeld [19] 
showed there are (n!) n of them. It follows that one randomly chosen has 
expected Kolmogorov complexity in © (log(n!) n£ ^ = 0(?/logn); whereas 
Borel's normal numbers can be much less complex than empirical entropy 
suggests, de Bruijn sequences can be much more complex. 

Empirical entropy also has connections to algorithm design. For example, 
Munro and Spira [18] used Oth-order empirical entropy to analyze several 
sorting algorithms and Sleator and Tarjan [22] used it in the Static Op- 
timality Theorem: Suppose we perform a sequence of m operations on a 
splay-tree, with Sj being the target of the ith operation; if S — S\ ■ • ■ s m in- 
cludes every key in the tree, then we use 0((Ho(S) + l)m) time. Of course, 
most of the algorithms analyzed in terms of empirical entropy are for data 
compression. Manzini's analysis [16] of the Burrows- Wheeler Transform [3] 
is particularly interesting. He proved an algorithm based on the Transform 
stores any string S of length m over an alphabet of size n in at most about 
(8He(S) + l/20)m + n e (2n logn + 9) bits, for all £ > simultaneously. Subse- 
quent research by Ferragina, Manzini, Makinen and Navarro [8], for example, 
has shown that if n e+1 logm G o(m logn), then we can store an efficient index 
for S in (H e (S) +o(logn))m bits. Notice we cannot lift the restriction on n and 
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£ton e G 0{m): If S is a randomly chosen n-ary linear de Bruijn sequence of or- 
der £, then m = n £ + £ — l and H^(S) = 0, so (cH e (S) + o(logn))m = o{n l \ogn) 
for any c, but K(S) G ©(r/logn) in the expected case. 

In this paper we investigate further the relationship between the order £, 
the alphabet size n and the string length m. Our results suggest £th-order 
empirical entropy stops being a reasonable complexity metric for almost all 
strings about when n e surpasses m. For simplicity, we assume £ and n are 
given to us as (possibly constant) functions from m to the positive integers and 
consider S G {1, . . . , n} m . In Section 2 we prove that, for any fixed c > 1 and 
e > 0, if r/ +1//c logn G o(m) and m is sufficiently large, then K(S) < (cHi(S) + 
e)m. We use a new upper bound for compressing probability distributions, 
which extends our results from [9] and may be of independent interest. In 
Section 3 we prove that if e < 1/c, £ is fixed, n e+1 / c ~ e G f2(m) and m is 
sufficiently large, then K(S) > (cH e (S) + | logn) m with high probability for 
randomly chosen S. As a corollary we prove a nearly matching lower bound 
for compressing probability distributions. 

It seems interesting that slightly changing the relationship between £, n and 
m can change (cHi(S) + o(logn))m from an upper bound on K(S) to an al- 
most certain lower bound. Phenomena like this one, in which small shifts in 
parameters change a property asymptotically from very likely to very unlikely, 
are called threshold phenomena; they are common and well-studied in several 
disciplines (see, e.g., [12]) but we know of no others related to data compres- 
sion. Although our proof of a threshold phenomenon requires £ to be fixed, 
we emphasize it holds for any constant coefficient c > 1 before Hn(S) and any 
o(logn) second term in the formula. 



2 Upper bounds 

We first rephrase the definition of empirical entropy: For £ > 0, the £th-order 
empirical entropy of a string S is the minimum self-information per charac- 
ter of S emitted by an £th-order Markov process. The self-information of an 
event with probability p is log(l/p). An £th-order Markov process is a string of 
random variables in which each variable depends only on at most £ immediate 
predecessors (see, e.g., [20]); a process is said to emit the values of its vari- 
ables. We use relative entropy [14], also called the Kullback-Leibler distance, 
to prove the two definitions equivalent. Let P — p±, . . . ,p n and Q — q±, . . . , q n 
be probability distributions over {l,...,n}; the relative entropy between P 
and Q, D(P\\Q) = J2i=iPi\og{Pi/<li), is often used in information theory to 
measure how well Q approximates P. Although relative entropy is not a dis- 
tance metric — it is not symmetric and does not obey the triangle inequality 
— it is when P = Q and positive otherwise. 
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Theorem 1 For any string S G {1, . . . ,n} m and £ > 0, we have Hi(S) = 
— mm {log(l/Pr[Q emits S]) : Q is an £th-order Markov process |. 



PROOF. Consider the probability an £th-order Markov process Q emits S. 
Assume, without loss of generality, that Q first emits si • • • se with probability 
1. For a G {1, . . . , n} £ , let P a = p a ,i, ■ ■ ■ ,Pa,n be the normalized distribution 
of the characters in S a , so H(P a ) = H (S a ); let Q a = q aA , q a>n , where q a>a 
is the probability Q emits a immediately after an occurrence of a. Then 



log 



Pr[Q emits S] 
1 



= lo § n 



i=e+i Qsi-t—si-i,si 
1 



= e+i Qsi-e—si-i,si 

1 



\a\=eaes a W 

= E \ S *\ E Pa,a (log — + log — J 

\ a \=e aes a V W P a > a J 
= £ |S a |(Z}(PjQ a )+#(P Q )) 

|a|=< 

> £ |S a |if(P a ) 

|a|=* 

= H £ (S)m , 

with equality throughout if P Q = Q a for a G {1, ... , n} e . □ 



We now consider how compactly we can store probability distributions, Markov 
processes and, ultimately, strings. 

Lemma 2 Fix c > 1 and e > and let P = p 1 , . . . ,p n be a probability dis- 
tribution over {1, . . . , n}. For some probability distribution Q with D(P\\Q) < 
(c — l)H(P) + e, storing Q takes 0(n 1 / c logn) bits. 



PROOF. Let t < rn l / c be the number of probabilities in P that are at least 
-^rjz-, where r = For each such p^, we record i and [pir 2 n\. Since r 

depends only on e, which is fixed, in total we use 0<y/ c log n) bits. This 
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information lets us later recover Q = qi, . . . , q n , where 



Qi = S 



, _ l\ [pjr 2 n\ 1 

I r , _ . i-i 11 Pi ^ rn i/c j 

otherwise. 



k r(n - 1) 



Suppose pi > ^j- c \ then p^n > r. Since E {b°j r Vl : > ^fa} < r 2 n, 

Pi ( T p^T^Tl \ T 

Pi log — < Pi log • , ' , , < 2pi log = Pi€ . 

qi \r — 1 [pir z n] J r — 1 



Now suppose pi < ^j- c ] then p;log(l/pi) > ^logn. Thus, 

Pi , r(n — t) ^ (c — l)pi . , 1X , l 

Pi log — < Pi log tj— < logn < (c - l)pi log — . 

9i rn l ' c c pi 



Finally, since plog(l/p) > for p < l, we have 

P>(P||Q) < £ { (c - log i : < -4^} + e < (c - l)ff(P) + e . 
{ Pi rn 1 ' J 



□ 



Corollary 3 Fix c > 1 and e > ano? consider a string S e {1, . . . , n} m . For 
some £th- order Markov process Q with log(l/ Pr[Q emits 5]) < (cHe(S) + e)m, 
storing Q takes 0(n i+1 ^ c logn) feiis. 



PROOF. First we store si • • • s^. For a e {1, . . . , n}^, let P a = p Q) i, . . . ,p Q) „ 
be the normalized distribution of characters in and let Q a = q a>1 , . . . , q a ^ n 
be the probability distribution with D(P a \\Q a ) < (c — l)if(P a ) + e obtained 
from applying Lemma 2 to c, e and P a . We store every Q a , using a total of 
0(n^ +1 / c logn) bits. 

This information lets us later recover a Markov process Q that first emits 
Si ■ ■ ■ S£ and in which, for a e {1, ... , n}^ and a G {1, . . . , n}, the probability 
a is emitted immediately after an occurrence of a is g a a . As in the proof of 
Theorem 1, logll/ Pr[Q emits S]) = £ H= , |S„|(D(P„||Q„) + fJ(PJ), so 

108 P,[Q Its S] < £ |S « I(C// ( P «) + f > £ (cft ( S » + e)m ' 

□ 
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We note that, given a string S G {1, . . . , n} m , we can store an £th-order Markov 
process Q with log(l/Pr[Q emits S]) = H e (S) in (n i+1 log + l)) bits, 
as a table containing # a (Sa) — i^aa(S) < m for aa G {1, . . . ,n}^ +1 . Grossi, 
Gupta and Vitter [10] investigated the space needed for such a table; they also 
showed that, apart from the cost of storing the table, we can store S in H e (S) 
bits. However, because we do not see how to store the table in less space when 
there is a constant coefficient c > 1 before He(S), we tolerate the e term in 
Corollary 3 and the following theorem. 

Theorem 4 Fix c > 1 and e > and let £ and n be functions from m to the 
positive integers. Consider a string S G {l,...,n} m . // n e+1 ^ c log n G o(m) 
and m is sufficiently large, then K(S) < (cHg(S) + e)m. 



PROOF. By Corollary 3, since n e+1 ^ c \ogn G o(m) and m is sufficiently large, 
we can store an £th-order Markov process Q with log(l/Pr[Q emits S]) < 
(cHi(S) +e/2)m in em/2 — 1 bits. Shannon [20] showed how, given Q, we can 
store S in [log(l/Pr[Q emits S])~\ bits. Thus, we can store Q and S together 
in fewer than [cH^{S) + e)m bits. □ 



3 Lower bounds 



Consider the so-called birthday paradox: If we draw m times from {1, . . . , n}, 
then the probability at least two of the numbers drawn are the same is about 
1 - l/e-^. Thus, for i > 1, if n 1 / 2 G uj(m) and S is chosen randomly, 
then with high probability H^{S) = because no character appears more 
than once in S. (Notice also H (S) < logm < log(n)/2 for sufficiently large 
m.) Thus, we cannot lift the restriction on n and I in Theorem 4 to n 1 l 2 ~ t £ 
0(m). We use a similar but more complicated argument to show we cannot 
even lift the restriction to n £+1 / c ~ e G 0(m). Essentially, we use a Chernoff 
bound on the probability of there being any frequent ^-tuples in S. Since 
the probability of an £-tuple occurring somewhere in S depends on whether 
it occurs in neighbouring positions, we apply the following intuitive lemma 
(proven in, e.g., [17]) before we apply the Chernoff bound. 



Lemma 5 Let Xi, . . . , X m be binary random variables such that, for 1 < i < 



m and b G {0, 1}* 



Pr 



Xi ■ ■ ■ Xi_ 



independent binary random variables, each equal to 
< q < I, 



< p. Let Y u ...,Y m be 
1 with probability p. For 



Pr 



m 




m 




< Pr 


y^Y" ? - > qm 


J =1 




j =1 
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Theorem 6 Fix c > 1, e with < e < 1/c and £ > 1 and /et n be a 
function from m to the positive integers. Choose a string S E {l,...,n} m 
uniformly at random. If n e+1 / c ~ e E fi(m) and m is sufficiently large, then 
K(S) > (cH e (S) + § log n) m with high probability. 



PROOF. Since there are n m choices for S and only 

E : < i < K 1 " e/3)mlognj} < 2n (1 - e/3)m - 1 



binary strings of length at most (1 — e/3)mlogn, we have K(S) > (1 — e/3) 
mlogn with probability greater than 1 — 2/n em//3 . Thus, we need only show 
cHn(S) < (1 — 2e/3) logn with high probability. By definition, 



H e (S) 

<max{H (S a )} 

\a\=t 

<max{log|{a : a E S a }\} 

\a\=£ 

<max{log(|{a : a E S a ,a ^ a}\ + l)} . 



Notice n E uj [ m e + 1 / c . We will show 



Pr 



| {a : a E S a ,a £ a}\ > n l ' c ~ 2 ^ 



< 



W3_ 



for each a E {1, . . . , n} e , so 



Pr 



max{|{a : a E S a , a E~ a}\} > n 1/c ~ 2t/3 - 

\a\=£ 



Pr 



max {log (|{a : a e S a ,a £ a}\ + £)} > ~ ^gn 



< 



< 



n 



f 

n 



and cHi(S) < (1 — 2ec/3) logn < (1 — 2e/3) logn with high probability. 



Consider a E {1, . . . , n} 1 . Let X ± , . . . , X m _^ be binary random variables, with 
Xi — 1 if Si • • • Sj+^-i = a and s i+ £ ^ a. Notice \{a : a E S a , a E~ a}\ + £ < 
Y%=i Xi + £. For £+1 < % < m — £, by definition, X; is independent of 
Xi, . . . , Xj_^„!; if any of . . . , Xj_! are 1, then at least one of Sj, . . . , s i+ ^_i 
is not in a, so Xj = 0; and 
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Pr Xi — 1 — • • • — Xj_x — 

_ Pr [Xi = 1 andX^ = • • • = X^ = 0] 
~ Pr [X^e = ... = X i _ 1 = 0] 
Pr[X, = 1] 



< 



< 



1-Pr [X l _ e = lor ... orX^ = 1] 

Pr[X, = 1] 
l-££UPr[A,- = l] 

1/r/ 
1 - £/n E 

1 



Let Y]_, . . . ,Y m -e be independent binary random variables, each equal to 1 
with probability p - . 
because Pr 



and let q = vtllJlll ^ If g > 1 the proof is finished, 



Y^Li Xi > q(m — £) =0; otherwise by Lemma 5, 



Pr 



.1=1 



< Pr 



i=i 



and it remains for us to show 



Pr 



i=i 



< 



W3_ 



Since £ is fixed and n e+1 /°~ € g Jl(m), we have p(m — £) G ©(n 1 / 0-6 ) C 
o(q(m — £)); thus, for sufficiently large m, q(m — t) > 6p(m — £) and we 
can use the following simple Chernoff bound [11]: 



Pr 



'm—l 



i=i 



< 



2q(m-e) 2 nl/c ~ 2i/3 - £ 



Finally, since e < 1/c, 



Pr 



i=l 



— ora £ / 3 -£ 



□ 
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Corollary 7 Fix c > 1 and e witt < e < 1/c and let P be a probability 
distribution over {1, . . . , n}. In the worst case, for any probability distribution 
Q with D(P\\Q) < (c - l)H(P) + o(logn), storing Q tafces cj(n 1/c ~ e ) frtfe. 



PROOF. For the sake of a contradiction, assume there exists an algorithm A 
that, given any probability distribution P over {1, . . . , n}, stores a probability 
distribution Q with D(P||Q) < (c-l)H(P) + o(\ogn) in 0(n 1/c ~ e ) bits. Then 
a proof similar to that of Theorem 4, but substituting A for Lemma 2, yields: 

Fix c > 1 and e < e < 1/c and £ef £ and n 6e functions from m to 
the positive integers. Consider a string S E {1, . . . , n} m . Ifn e+1 / C ~ e E o{m), 
then K{S) < (cH e (S) + o(logn))m. 

Suppose we fix c and £, choose e < 1/c and n such that n t+1 l c ~ e E o(m) but 
n^+i/c-e/2 E il(m), and choose a string 5 G {1, . . . , n} m uniformly at random. 
The claim above gives K(S) < (cH e (S) + o(logn))m but by Theorem 6, for 
sufficiently large m, K(S) > (cHi(S) + | lognj m with high probability. □ 



4 Future work 

Suppose we want to store a probability distribution P over a set of strings. 
We recently proved that, in theory, if the relative entropy is small between 
P and the probability distribution induced by a low-order Markov process Q, 
the we can store P accurately and efficiently by storing an approximation of 
Q. We hope experiments will show this technique to be practical. 

Our proof of Theorem 6 is slightly complicated because if, for some £-tuple 
a, a non-empty string is both a suffix and a prefix of a, then occurrences of 
a can overlap and any one occurrence increases the probability of others. In 
this paper we used the fact that if two occurrences of a overlap, the the first 
must be immediately followed by a character in a. We recently proved that, 
moreover, it must be immediately followed by one of 0(\og£) characters. We 
are now trying to use this result to prove a version of Theorem 6 that does 
not require £ to be fixed. 

We are also trying another approach to generalize Theorem 6. Results about 
linear de Bruijn sequences are often proved by considering them as Eulerian 
tours on certain graphs, called de Bruijn graphs. In fact, any string can be 
considered as a walk on a de Bruijn graph; random strings correspond to 
random walks. Since de Bruijn graphs are good expanders, random walks on 
them have properties that may be useful in reasoning about random strings. 
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